8 research outputs found

    Computational pan-genomics: status, promises and challenges

    Get PDF
    International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

    Computational pan-genomics: Status, promises and challenges

    Get PDF
    Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different Computational methods and paradigms are needed.We will witness the rapid extension of Computational pan-genomics, a new sub-area of research in Computational biology. In this article, we generalize existing definitions and understand a pangenome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a Computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations

    Towards comparative pan-genomics

    No full text
    Comparative genomics investigates the genomic makeup of species to unravel their unique variations and evolutionary relationships. High-throughput sequencing technologies have enabled reading the DNA content of a wide variety of species at an unprecedented rate. With the ongoing advances in these technologies, many species are or will soon be represented by a large number of genomes. Such genomes can be highly similar, but their differences in sequence and structure are of interest in many applications as they usually underlie specific traits. Having a wealth of genomes for a species, the current practice of basing comparative studies on a single reference genome is neither efficient nor effective. Traditional reference-based approaches make use of only a single reference genome, ignoring the potentially novel genomic content found in other individuals. As a result, over the last decade there has been a growing interest in developing pan-genome structures capable of capturing a wide genomic landscape of species. In this thesis, we develop a pan-genomic platform based on a novel representation of genomes with some functionalities for sequence retrieval, structural annotation, homology detection and read mapping. Chapter 1 briefly introduces molecular biology and the revolution in genome sequencing. Then we introduce evolution and some basic concepts in genomics and comparative genomics which are necessary for the readers to be able to follow the chapters of this thesis. We emphasize the shortcomings of traditional reference-based approaches in comparative genomics and introduce pan-genomics as a solution which recently has received much attention. We introduce the essentials of a pan-genomic platform from the perspective of the Computational Pan-genomics Consortium, and classify existing pan-genomic data structures into two general categories of variation-aware and multi-genome data structures. Finally, we discuss the de Bruijn graph including the stranded version we introduce in chapter 2.       Chapter 2 highlights the necessity of a transition from reference-centric to pan-genomic approaches. As a comprehensive representation of large number of genomes, we introduce a generalized de Bruijn graph. We present a novel algorithm to construct such a DBG and take advantage of the Neo4j graph database for consistent and scalable storage of the graph. We develop a toolset, called PanTools, which provides some useful functionalities e.g. for annotation, graph update and sequence retrieval. We demonstrate the performance of PanTools on large datasets of bacterial, fungal and plant genomes. We illustrate how sequence variation creates specific sub-structures in the pan-genome including an example of the variability of a famous gene, called FRIGIDA, among 19 A. thaliana accessions. Chapter 3 emphasizes the need for highly efficient tools to detect homology in the ever-increasing genomic data. We present an efficient method for detecting homology across a large number of individuals at various evolutionary distances. The presented k-mer based approach considerably reduces the number of alignments between pairs of peptide sequences without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method in large proteomes of bacteria, fungi, plants and Metazoa. The detected homology groups are stored in the pan-genome graph database, and can be queried, for example, for their size, copy number and conservation rate. Chapter 4 focuses on correcting errors in next-generation sequencing reads which can improve the performance of assembly and increase the accuracy and sensitivity of quantitative analyses such as differential expression analyses and variant calling. We develop a tool, called ACE, based on a k-mer trie data structure to correct for substitution errors in short read data. We show that ACE yields higher gains in terms of coverage depth, outperforming state-of-the-art competitors in the majority of cases, on both MiSeq and HiSeq Illumina data. Chapter 5 presents a multi-genome read mapping approach which utilizes the index and pan-genome structure, introduced in Chapter 2, to map short reads to large number of genomes, simultaneously. One advantage is the efficiency as the joint index enables anchoring the reads to all the genomes at once avoiding repetitive alignments when the genomes are highly similar. Another advantage is that we can resolve the reference bias by including regions that are entirely missing in the reference but present in some other accessions. Moreover, such a multi-genome read mapper can be utilized in binning and abundance estimation of meta-genomic samples. In this chapter, we successfully apply this approach to map genomic and metagenomic reads to large collections of viral, archaeal, bacterial, fungal and plant genomes. Chapter 6 puts forward some ideas on the future challenges and opportunities in the field of pan-genomics. We discuss the emerging shift from reference-centric to pan-genomic approaches and the necessity of substantial adjustments and redevelopments of traditional methods and applications such as genome annotation, structural variation detection and real-time pan-genome visualization. We conclude that the design and engineering introduced in this thesis contributes to the field and the growing number of similar efforts indicates a bright future ahead for comparative pan-genomics.  &nbsp

    ACE: accurate correction of errors using K

    No full text

    Efficient inference of homologs in large eukaryotic pan-proteomes

    No full text
    BACKGROUND: Identification of homologous genes is fundamental to comparative genomics, functional genomics and phylogenomics. Extensive public homology databases are of great value for investigating homology but need to be continually updated to incorporate new sequences. As new sequences are rapidly being generated, there is a need for efficient standalone tools to detect homologs in novel data.RESULTS: To address this, we present a fast method for detecting homology groups across a large number of individuals and/or species. We adopted a k-mer based approach which considerably reduces the number of pairwise protein alignments without sacrificing sensitivity. We demonstrate accuracy, scalability, efficiency and applicability of the presented method for detecting homology in large proteomes of bacteria, fungi, plants and Metazoa.CONCLUSIONS: We clearly observed the trade-off between recall and precision in our homology inference. Favoring recall or precision strongly depends on the application. The clustering behavior of our program can be optimized for particular applications by altering a few key parameters. The program is available for public use at https://github.com/sheikhizadeh/pantools as an extension to our pan-genomic analysis tool, PanTools.</p
    corecore